Pharmacogenetic allele variant frequencies: An analysis of the VA’s Million Veteran Program (MVP) as a representation of the diversity in US population

We present allele frequencies of pharmacogenomics relevant variants across multiple ancestry in a sample representative of the US population. We analyzed 658,582 individuals with genotype data and extracted pharmacogenomics relevant single nucleotide variant (SNV) alleles, human leukocyte antigens (HLA) 4-digit alleles and an important copy number variant (CNV), the full deletion/duplication of CYP2D6. We compiled distinct allele frequency tables for European, African American, Hispanic, and Asian ancestry individuals. In addition, we compiled allele frequencies based on local ancestry reconstruction in the African-American (2-way deconvolution) and Hispanic (3-way deconvolution) cohorts.


Introduction
Genetic polymorphisms of metabolic pathways and cytochrome P450 (CYP) genes are associated with altering pharmacokinetics and pharmacodynamics of the absorption, distribution, metabolism and excretion (ADME) of drug and toxic compounds (xenobiotics). Gaining a better understanding of the interindividual variations of this genetic makeup is necessary to understand the metabolic rate of efficiency a xenobiotic is metabolized. In general, heritable selective pressure is a major determinant of variant frequency among the different ethnic populations, typically presenting with two or more variants identified in most metabolic pathway genes. Common star allelic variants (referred to herein as "variant") prescribe a "normal" metabolic cycle while others convey a heightened or depressed metabolic cycle. Much of this is well catalogued in a variety of collections, including PharmGKB (https://www.pharmgkb.org/ ), with clinically actionable variant-vs-drug combinations presented in the Clinical Pharmacogenetics Implementation Consortium (CPIC https://cpicpgx.org/) and the Dutch Pharmacogenetics Working Group (DPWG http://upgx.eu/guidelines). While these variants are catalogued in the multitude of databases, it is also important to recognize that many of the variants identified heavily rely upon data derived from unique ethnic populations. Ethnic population data are typically derived from a limited collection of self-identified subjects and the unique variants associated within that ethnicity. Moreover, certain variants designated as normal are unique to a select ethnicity and not represented among others, such as is known for CYP2D6 and codeine, or CYP3A5 and Tacrolimus [1]. Lastly, while certain populations are considered relatively homogenous over several generations as dictum of culture, not all data is reflective of this consideration which further contributes to the diversity of drug responses. The distribution of inherited xenobiotic-metabolizing alleles differs considerably between populations [2,3] and appears to be rigid in frequency among ethnically stable populations. While there are several large-scale data sets that can provide variant frequencies of pharmacogenomic genes for researchers and clinicians to use, most of these data are not representative of the "melting pot" of the genetic ancestries present within the United States. (US). While one could rely on self-reported ethnicity to improve variant frequency found among unique populations, such data is imperfect [4]. Further complicating possible variant predictions is that nearly everyone has at least one pharmacogenomic variant allele with as many as 3% carrying 5 allelic variants [5]. These findings limit the overarching use of ethnically related variant frequencies in diverse populations such as is present in the US. This is because the data available is limited to a specific self-reported ethnicity and/or does not consider other variants that could be present among other ethnicities. This is a particularly important consideration for research of personalized drug therapy and potentially changes the healthcare guidelines provided by groups such as CPIC and DPWG that select important alleles for clinical genotyping based in part on population prevalence.
To address the issues with the use of pharmacogenomic variants and to further explore possibly pharmacogenetically-associated variant markers we used the Million Veteran Program (MVP) [6] with >800,000 participants to generate a coherent representation of allelic frequencies present within a US population. The MVP cohort is mostly male but is very diverse and represents the US population ancestry in general. The genotype data was imputed using the African Genome Resources (AGR) and 1000 Genomes imputation reference panels.

Results
Our analysis is based on the Release 4 of MVP data with 658,582 individuals genotyped with the MVP-1 Axiom array [7]. Participants were assigned ancestry based on the HARE algorithm (Harmonized Ancestry and Race/Ethnicity) [8]. The MVP cohort is diverse with~30% of the cohort assigned as non-European (EUR 467k, AFR 125k, HIS 52k, ASN 8k). A small fraction of the cohort was highly admixed and not assigned to any of the four major ancestries and is not included in this analysis (<2%).
Our aim is to provide ancestry specific variant frequency catalog for a significant fraction of pharmacogenomics relevant variants in a large cohort representative of the US population. We examined Single Nucleotide Variants (SNV), an important pharmacogenetics relevant Copy Number Variant (CNV) as well as Human Leukocyte Antigen (HLA) 4-digit alleles. We defined our pharmacogenomics gene set by combining information from two publicly available data bases, PharmGKB and PharmVar (Methods). Overall, we were able to determine SNV frequencies for 273/1339 targeted SNVs, in 148/152 targeted genes. Details on variant selection can be found in Methods. S1 File provides a comprehensive table of all allele frequencies. As expected, SNV allele frequencies vary substantially among HARE groups (Fig 1).
In addition to HARE allele frequencies, we used Local Ancestry Inference (LAI) to identify ancestral origin of individual chromosomal segments and compute allele frequencies based on the local ancestry. We "painted" the African American samples (125 k individuals) using twoway deconvolution, extracting allele frequencies for the AFR and EUR tracks. For the Hispanic individuals (52 k) we used three-way deconvolution to compute allele frequencies for EUR, AFR and Native American (AMR) tracks. Details on the LAI will be presented elsewhere. In Fig 2 we present allele frequencies derived from HARE groups, LAI as well as two publicly available databases, 1 k genome and gnomAD (Methods). The most striking differences are observed for Hispanics, a group that is extremely heterogeneous and not well defined in the genetics literature.
Allele frequencies for the three major MVP HARE groups (EUR, AFR, HIS) are in good agreement with gnomAD derived estimates. However, comparison of gnomAD HIS frequencies with the LAI AMR track of the MVP HIS population reveals significant differences (S1 Fig). Here we note that the LAI AMR track provides much better allele frequency

PLOS ONE
Pharmacogenetic allele variant frequencies MVP ascertainment than the 60 AMR genomes that were used to anchor the local ancestry deconvolution. In the MVP HARE HIS population (52k individuals) the AMR track contributes~30% of the genome resulting in an effective population size of~15k individuals. Furthermore, while the 60 AMR genomes we used to anchor ancestry deconvolution provide sufficient multi-locus information to resolve local ancestry, the AMR track is a much better sampling of the AMR genome as it exists today in the US population.
Sirolimus is a widely used immunosuppressant and the variant controlling its metabolism, rs2242480 (allele CYP3A4 � 1G) [10], varies among populations. We observe widely different allele frequencies in the three major groups (EUR, AFR, HIS) for gnomAD (0.09, 0.74, 0.37) and MVP (0.09, 0.73, 0.35). However, the local ancestry derived AMR allele frequency (0.67) is almost twice as high as the HIS allele frequency. The same observation applies to tacrolimus, another significant immunosuppressant. The controlling variant, rs776746 (CYP3A5 � 3), shows large variation in major MVP groups (0.07, 0.70, 0.21) and there is a significant difference between HIS and local ancestry derived AMR allele frequency (0.31). Thus, recent demographic history of individuals, and the fraction of inheritance derived from different major population groups, has a large impact on the allele frequency distribution of pharmacogenomics relevant variants.
CYP2D6 is an important component of cytochrome P450 and is involved in the metabolism of many commonly prescribed medications, including antidepressants, antipsychotics, betablockers, opioids, antiemetics, atomoxetine, and tamoxifen [9,11]. In addition to SNV frequencies for the most significant variants [12], Fig 2 presents allele frequencies for an important copy number variant, the whole gene deletion designated as CYP2D6 � 5 in the pharmacogenomics literature (Figs 2 and 3). We called the CYP2D6 � 5 CNV using UMAP, a machine learning algorithm [13]. Assignments are clearly separated for copy gain and copy loss. Furthermore, we can clearly separate single and double copy loss (Fig 3). The UMAP approach offers a clear advantage over classification based on Principal Components Analysis (PCA, S2 Fig). We note that UMAP does not represent a general approach to copy number variation detection. Hyper-parameters for the model are tuned for the specific, relatively common CNVs. Furthermore, we achieve optimal performance only when we tune the model separately for individual HARE ancestries.   Table 1A presents CYP2D6 � 5 allele frequencies for the three major HARE groups. The major survey of CYP2D6 � 5 [14] finds slightly different allele frequencies, e.g., 89% in Beoris et al vs 78% in MVP for copy number 2 EUR. Our findings are closer to the frequencies reported by gnomAD (80%, MCNV_22_1026 | gnomAD SVs v2.1 | gnomAD (broadinstitute.org). The differences might be due to different assays: single site PCR vs SNP genotyping (MVP) or sequencing (gno-mAD), or differences in ascertainment of ethnic background. We are not able to run the UMAP algorithm on phased chromosomes, but we can use ancestry deconvolution and test CNV status in individuals ancestry-homozygous at CYP2D6, e.g. AMR/AMR individuals in the HARE HIS group. Results are shown in Table 1B. The ancestral AMR genome harbors fewer single-copy samples while the EUR tracks are in close agreement with observations in the EUR HARE cohort.
In addition to SNVs and specific CNVs we derived HLA alleles from SNP genotypes using the HIBAG algorithm [15] (HLA Imputation using attribute BAGging). Although HLA status Results are shown just for the HARE AFR cohort; clusters were derived using UMAP [13].
https://doi.org/10.1371/journal.pone.0274339.g003 does not modify pharmacokinetics there are well established adverse drug reactions in the presence of specific HLA alleles. For example, abacavir, a common anti-retroviral, causes abacavir hypersensitivity syndrome in the presence of HLA-B � 5701; Allopurinol is typically a safe drug for the treatment of gout but in the presence of HLA-B � 5801 is associated with an increased risk for allopurinol induced severe cutaneous adverse drug reaction (SCAR) with most serious cases developing Stevens-Johnson syndrome and toxic epidermal necrolysis (SJS/TEN) [16]. HLA 4-digit Class I and Class II allele distribution for four HARE groups is shown in Fig 4. As expected, allele frequencies are highly variable in the four groups, including the three alleles most relevant for pharmacogenomics: HLA-A � 3101, HLA-B � 5701 and HLA-B � 58:01 ( Table 2). Details of HLA allele imputation will be presented elsewhere. Here, we note that HLA imputation precision was >90% for HARE EUR, AFR and HIS groups. However, we currently observe lower precision for the ASN predictions due to lack of an appropriate training set.

Discussion
We present a survey of pharmacogenetics relevant variants in the MVP, a sample representative of the US population. Using the MVP-1 Axiom array we can resolve a large fraction of known pharmacogenomics alleles, either as direct or as imputed genotypes. In addition, we use the genotypes to derive population allele frequencies for an important common CNV, whole gene deletion/duplication of CYP2D6, as well as population distribution of HLA alleles, including HLA alleles important for drug delivery decisions. As expected, there is substantial variation in allele frequencies between ancestry groups for a subset of the examined variants. In addition to allele frequencies of individual ancestry groups (HARE EUR, AFR, HIS, ASN) we use an innovative approach, LAI, to derive allele  frequencies of ancestral genomes present in recently admixed US populations; 2-way deconvolution for HARE AFR (EUR, AFR) and 3-way deconvolution for HARE HIS (EUR, AFR, AMR). LAI conveys important information on allele frequency distribution in under-represented populations. Allele frequency is an important consideration in the formulation of clinical genotyping guidelines provided by groups such as CPIC and DPWG. The AMR track is a much better sampling of the AMR genome as it exists today in the US population compared to the small, and not necessarily representative, number of samples from AMR populations (60 vs 15,000 effective genomes). Sites with significantly different allele frequencies in AMR and EUR/AFR tracks are sites where self-identification as HIS provides limited power to guess likelihood of drug sensitivity. Thus, they are sites where groups such as CPIC and DPWG should rely on LAI minimum allele frequency rather than ethnic group allele frequency for recommendations. The large sample size we use for our analysis is particularly important for low frequency variants. For example, single and double-copy deletions of CYP2D6 are relatively rare. Therefore, it is inappropriate to derive frequencies from a reference a panel, even under the assumption that the reference panel is representative of the general US population.
There are limitations in our derived population allele frequencies. While SNVs are phased neither CNV nor HLA calls are phased genotypes. Furthermore, successful phasing in the overall genome does not guarantee successful phasing in complex genomic regions such as CYP2D6. For CYP2D6 in particular, we have been able to resolve whole gene deletions and duplications, but we are certain that there is additional small scale copy number variation that cannot be resolved by our UMAP machine learning approach. For example, small deletions and complex rearrangements involving the proximal CYP2D7 and CYP2D8 pseudogenes. It is likely that such complex variation has a minor contribution to the population distribution of CYP2D6 pharmacogenetics. However, resolution of population level frequency of such variants will require specialized assays such as long-range sequencing. Improving phasing and imputation will aid the eventual derivation of star alleles in these regions.
We think that this comprehensive allele frequency report in a population representative of the US genome diversity will become a useful reference for future guidelines of relative importance of alleles worth ascertaining in pharmacogenetics screens. High variance of allele frequency, not only among ethnic groups but most importantly among ancestral genomes contributing to mixed ancestry individuals in the US population, further underscores the need for individual typing rather than reliance on self-reported ethnicity on drug delivery decisions in clinical practice. We hope this manuscript promotes the adoption of personalized medicine in under-represented populations.

Ethics statement
The Veterans Affairs (VA) central institutional review board (cIRB) and site-specific IRBs approved the Million Veteran Program study.

MVP genotype data
The MVP Release 4 dataset includes 658,582 individuals and consists of a hard-called dataset of 667,955 variants prepared as described in Hunter-Zinck et al. 2020 [7], as well as an imputed dataset. Genotype calls passing initial quality control were further prepared for phasing and imputation by removing markers with high missingness (>20%), monomorphic markers, and markers significantly out of Hardy-Weinberg equilibrium (p < 1e-6 adjusted for ancestry). Haplotypes were then statistically phased using SHAPEIT v4.1.3 (https://odelaneau.github.io/shapeit4/) and imputed into the African Genome Resources and 1000 Genomes imputation panels using Mini-mac4 (https://genome.sph.umich.edu/wiki/Minimac4). Each individual in the cohort was assigned a HARE group (EUR, AFR, HIS, or ASN), a surrogate variable for ancestry and race/ethnicity (Fang et al. 2019

Identification of known pharmacogenetics variants
We curated a catalogue of known or high-confidence pharmacogenetics variants by rsID from the PharmGKB and PharmVar databases. From PharmGKB, we downloaded variant summary data (https://api.pharmgkb.org/v1/download/file/data/variants.zip) and kept only variants with at least one Level 1 or 2 PharmGKB clinical annotation. From PharmVar, we downloaded the complete database (version 4.2.4) and kept all variants. In total, we identify 1,339 unique variants from 152 genes.

Identification of pharmacogenetics variants in the MVP genotype dataset
Genotyped dataset. We selected the intersection of known pharmacogenetics variants with the catalog of SNPs in the MVP array. We identified pharmacogenetics variants by chromosome location and rsID [7]. Imputed dataset. Imputation was performed using MINIMAC. We kept only variants with imputation R 2 > 0.9 within the ethnic group. We assigned rsIDs to imputed variants by intersecting variant genomic position with rsID genomic position in NCBI dbSNP (v154) using bedtools. We then identified pharmacogenetics variants by overlapping imputed variant rsIDs.
In total, we find 193 pharmacogenetics variants from 136 genes in the genotyped data set. Including the imputed variants, we expand the set to 273 variants in 148 genes. If we relax the selection criteria to include all imputed variants that satisfy imputation R 2 > 0.9 in any one of the 4 HARE groups (EUR, AFR, HIS, ASN) we expand the set to 408 variants.

Allele frequency analyses
Calculation of minor allele frequencies. HARE group-specific minor allele frequencies (MAFs) were calculated for the MVP hard-called and imputed datasets using PLINK2.
Local Ancestry Inference (LAI) based allele frequencies. Briefly, we performed LAI using rfmix2. We used 3,942 reference samples for EUR, AFR and Native American (AMR) ancestry collected by the 1000 genome project and the Human Genome Diversity Project (HGDP). The reference VCF files were curated by the gnomAD team (https://gnomad.broadinstitute.org/ downloads#v3-hgdp-1kg). We used local ancestry output to create separate, ancestry specific, VCF output files. Two files for the HARE AFR sample (EUR-AFR) and three files for the HARE Hispanic sample (EUR-AFR-NAT). The allele frequency extraction procedure was the same for LAI and gnomAD samples, described below.

Analysis and visualization
Visualization of MAFs, and calculation and visualization of MAF differences between MVP and 1000 Genomes, was performed using R.

HLA type predictions
4-digit HLA type predictions were generated for HLA-A and HLA-B from hard-called genotype data using HIBAG [15]. We chose the pre-fit Affymetrix Axiom UK Biobank Array 4-digit resolution model (https://hibag.s3.amazonaws.com/hlares_index.html), as the MVP genotyping array covers > 95% of this model's training variants for both loci. We used the European model for individuals assigned to HARE group EUR and the multi-ethnic model for individuals assigned to HARE groups AFR, HIS, and ASN. Predictions were generated by calling the predict() function from HIBAG. Frequencies were calculated for each 4-digit allele by HARE group.

S1 Fig. Allele frequency comparisons between gnomAD and MVP HARE groups for three groups (EUR, AFR, HIS).
For all three, correlation with gnomAD allele frequencies is high (R 2 >0.99). In the lower right we compare allele frequencies for gnomAD HIS and Local Ancestry Inference (LAI) derived allele frequencies for the AMR track of the HARE HIS group (R 2 = 0.91). We use three-way local ancestry deconvolution (EUR, AFR, AMR).